203
this, we need large amounts of data (a “training data set”) and feedback to the neuronal
network (by itself: unsupervised learning; from the outside: supervised learning) as to
whether the computer’s prediction was correct or incorrect afterwards, even for individual
molecules or sequences (predictions, for example, for the secondary structure in the pro
tein, for the localisation in the cell, etc.).
Genetic algorithms are a sophisticated search strategy that I myself have used enthusias
tically for many years. Here, solutions are bred in the computer with the help of artificial
evolution through selection, mutation and recombination of digitally programmed chromo
somes. These chromosomes then encode the problem you want to solve. This works sur
prisingly well, given sufficient populations of individuals and several hundred generations
of evolution. For example, one can obtain protein structures from the sequence by using
appropriate selection parameters with small error to the observable structure (Dandekar and
Argos 1994, 1996, 1997). The “catch” with this approach is only how to code the protein
structure efficiently enough in the chromosomes (e.g., by “internal coordinates”) and how
to design the selection “correctly” (many years of work and then requires a sufficient num
ber of known, experimentally resolved crystal structures). Another clever search strategy
for complex problems with a huge, often high-dimensional search space is to do like the
ants (ant colony optimization). Here, an anthill is electronically programmed, and the indi
vidual virtual ants scour the solution space. In doing so, they leave behind a scent trail. This
trail is amplified and turned into a virtual ant trail in the computer if there are particularly
good solutions along the searched route. This method is also surprisingly powerful for
complex problems, but also requires a lot of patience until one has sufficiently mapped the
problem one wants to solve in the real world into this virtual “forest of ants” so that the
solutions are tractable. A breakthrough in predicting 3D structures of proteins was recently
achieved by Senior et al. (2020) and Tunyasuvunakool et al. (2021).
14.3
Current Applications of Artificial Intelligence
in Bioinformatics
The high-dimensional data in biology and medicine contain various variables (features),
e.g. diagnosis, expression values, age, weight. In addition, there are complex relationships
and correlations, but also confounders (confounding variables), batch effects and multicol
linearity between the variables. In short, it is very time-consuming to find out which vari
ables are relevant and which are not. An application from artificial intelligence research
that has been used for a long time is machine learning (machine learning; Tarca et al. 2007;
Sommer and Gerlich 2013) in bioinformatics to structure the data and extract relevant
features, but also to develop classification models (predictive models). We have already
learned about PCA (Chap. 7) to decompose high-dimensional data into principal compo
nents and reduce their complexity (dimensionality reduction). Other methods are cluster
and regression analyses. While cluster analysis is used to classify data into groups (clus
ters) with similar characteristic structures (characteristics), regression analysis is used to
find correlations and relationships between variables.
14.3 Current Applications of Artificial Intelligence in Bioinformatics